-
Notifications
You must be signed in to change notification settings - Fork 30
feat: streaming in speech to text transcription #168
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: v0.4.0-rc1
Are you sure you want to change the base?
Conversation
72b2505
to
0f270a7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Resolve conflicts, I will test it tomorrow, but looks ok in general
<!-- Provide a concise and descriptive summary of the changes implemented in this PR. --> - [ ] Bug fix (non-breaking change which fixes an issue) - [x] New feature (non-breaking change which adds functionality) - [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected) - [ ] Documentation update (improves or adds clarity to existing documentation) - [x] iOS - [x] Android <!-- Provide step-by-step instructions on how to test your changes. Include setup details if necessary. --> <!-- Add screenshots here, if applicable --> <!-- Link related issues here using #issue-number --> - [x] I have performed a self-review of my code - [ ] I have commented my code, particularly in hard-to-understand areas - [ ] I have updated the documentation accordingly - [ ] My changes generate no new warnings <!-- Include any additional information, assumptions, or context that reviewers might need to understand this PR. -->
86cde69
to
f6f8f31
Compare
improved error handling, minor refactoring
f6f8f31
to
a4c262a
Compare
ddec73b
to
135eb72
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some questions, but my main issue is including stuff from speech to text in llm example
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need audio streaming in LLM example app? Maybe it would be more suitable in speech-to-text
app? Right now, over half of the file has nothing to do with our LLM feature that it should showcase
return ''; | ||
await this.encode(chunk); | ||
} catch (error) { | ||
this.onErrorCallback?.(new Error(getError(error) + ' encoding error')); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This regards all this.onErrorCallback
calls in this file. What if dev doesn't pass onErrorCallback
in constructor? It would make all those errors disappear into thin air. We may use something like suggested here:
#183
Maybe even define this.errorCallback in constructor like:
this.errorCallback = () => {
if (this.errorCallback) {
this.errorCallback(getError(e));
} else {
throw new Error(getError(e));
}
or some variation. We may also require to pass errorCallback or ignore this issue altogheter but I think it may result in poor DX
private trimLeft(numOfTokensToSlice: number) { | ||
for (let idx = 1; idx < this.seqs.length; idx++) { | ||
if (this.seqs[idx]![0] === this.config.tokenizer.bos) | ||
this.seqs[idx] = this.seqs[idx]!.slice(numOfTokensToSlice); | ||
} | ||
} | ||
|
||
private trimRight(numOfTokensToSlice: number) { | ||
for (let idx = 0; idx < this.seqs.length - 1; idx++) { | ||
if (this.seqs[idx]!.at(-1) === this.config.tokenizer.eos) | ||
this.seqs[idx] = this.seqs[idx]!.slice(0, -numOfTokensToSlice); | ||
} | ||
} | ||
|
||
private async trimSequences(audioLanguage?: string) { | ||
const numSpecialTokens = (await this.getStartingTokenIds(audioLanguage)) | ||
.length; | ||
this.trimLeft(numSpecialTokens + NUM_TOKENS_TO_SLICE); | ||
this.trimRight(numSpecialTokens + NUM_TOKENS_TO_SLICE); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We call trimSeqence
in loop where we push new seq
. Wouldn't it be enough then:
private trimLeft(newSeq, numOfTokensToSlice: number) {
if (newSeq[0] === this.config.tokenizer.bos)
newSeq = newSeq.slice(numOfTokensToSlice);
}
private trimRight(newSeq, numOfTokensToSlice: number) {
if (newSeq.at(-1) === this.config.tokenizer.eos)
newSeq = newSeq.slice(0, -numOfTokensToSlice);
}
private async trimSequences(newSeq: number[], audioLanguage?: string) {
const numSpecialTokens = (await this.getStartingTokenIds(audioLanguage))
.length;
this.trimLeft(newSeq, numSpecialTokens + NUM_TOKENS_TO_SLICE);
this.trimRight(newSeq, numSpecialTokens + NUM_TOKENS_TO_SLICE);
}
Almost feel like you can do without trimLeft
and trimRight
- just put it all in one function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this code is not strictly correct since we trimRight the second to last seq and trim left the last seq, but yes, I can simplify this
this.waveform = this.waveform.slice( | ||
this.windowSize - this.overlapSeconds * Number(this.seqs.length === 0) | ||
); | ||
const seq = await this.decodeChunk(chunk, audioLanguage); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We decode it online. What happens if decoding is slower than new samples coming in? Is it realistic case or it never happens?
Description
Type of change
Tested on
Testing instructions
Screenshots
Related issues
Checklist
Additional notes